1 Abstract

This Rmarkdown provides the analysis for the processed data in the paper entitled “Multivalent Antigen Display on Nanoparticles Diversifies B Cell Responses”. It includes the B cell receptors analysis, plots, and generation of intermediate files for plotting using other programs. The processed files used here are AIRR-compliant datasets which are automatically downloaded from Zenodo (DOI: 00000000), they are either Human Respiratory Syncytial Virus-specific (RSV-specific) or the full bulk repertoire from the vaccinated non-human primates (Macaca mulatta).

2 Needed libraries

Loaded libraries need to be present, the snakemake command will try to install them in the conda environment

library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)

3 Load data

Some data used here is stored as .rds format, but it can also be found in .tsv at Zenodo (DOI: 000000000).

data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
  mutate(group = case_when(group == "NP 100%" ~ "20-mer",
                           group == "NP 50%" ~ "10-mer",
                           group == "Sol" ~ "1-mer",
                           group == "PostF" ~ "PostF"))

clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")


# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
  filter(timepoint != "PV") %>%
  mutate(cdr3_aa_length = nchar(CDR3_aa),
         cdr3_aa = CDR3_aa,
         v_call = V_gene,
         j_call = J_gene,
         d_call = D_gene) 

# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
 
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))

4 Plasma profiling with Multidimensional Scaling

The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus having higher titers for all the competitions, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.

cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>% 
  select(-c(timepoints, ID, group)) %>%
  as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)

mds <- D_sim %>%       
  cmdscale(3) %>%
  as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")

mds <- mds %>%
  mutate(group = x$group,
         timepoints = x$timepoints,
         ID = x$ID)
return(mds)
}

# calculate cosine distance and generate MDS with and without PostF
mds <- cosine_distance(data_comp)
mds_no_postf <- data_comp %>% filter(group != "PostF") %>% cosine_distance()
ls_mds <- list(mds, mds_no_postf)

for(f in seq_along(ls_mds)){
# Plot and color by groups
    mds_plot <- ls_mds[[f]]
    plot <- mds_plot %>%
      ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) + 
      geom_point(size = 4, shape = 21) +
      ggtitle(f) +
      scale_fill_manual(values = fill_col_values) +
      scale_color_manual(values = color_values) +
      lims(x = c(-.6,.5), y = c(-.4, .4)) + 
      ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
      theme(axis.ticks = element_line(size = .5),
            legend.position = c(.1,.9)) +
      facet_wrap(~timepoints)
    
    print(plot)
    ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 6, height = 4)
  }

5 Save metadata from B cell repertoire sequencing

Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.

rep_seq_stat <-  lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <-  lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE) 
rep_seq_stat <- rep_seq_stat %>%
  select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))

write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )

6 Generate processed files for diversity index estimation

Intermediate files are commented out, so it will not save all of them. One could uncomment them if all the intermediate files are needed. Intermediate files were used to run Recon (v2.5) (Kaplinsky & Arnaout, Nat Commmun, 2016) according to default parameters, more about running Recon is described in the next section Recon estimates for RSV-specific diversity.

# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
  mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))

# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()

# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
  # save full clonotype files
  filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
    filter(ID_timepoint_spec == animals) %>%
    as.data.frame()
 # write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)

  # save summary files
  clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
  #write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }

# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
  # save summary files
  clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
    filter(ID_timepoint == animals) %>%
    as.data.frame() %>%
    mutate(
      specificity_group = "Total",
      ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
    ) %>%
    group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
    summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup()
 # write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
  }


# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()

for (animals in unique(clonotype_rsv$ID)) {
  # save summary files
  clonotype_rsv %>%
    filter(ID == animals) %>%
    as.data.frame() %>%
    group_by(grp, timepoint, ID) %>%
    summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
    arrange(desc(clonal_size)) %>%
    ungroup() 
   # write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
    }

to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
  select(clonal_size, ID_timepoint_spec) %>%
  group_by(clonal_size, ID_timepoint_spec) %>%
  summarise(size = n()) %>%
  ungroup()


for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
  to_recon_table <- to_recon_rsv %>%
    filter(ID_timepoint_spec == i) %>%
    select(clonal_size, size)

 # fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}

7 Calculating Species richness

7.1 Chao1 estimates for RSV-specific diversity

Species richness (Chao1) was calculated using the vegan r package. The samples were first subsampled 100 times for each animal/timepoint and then the species richness was estimated. The mean of those 100x replicates were used for plotting.

Here is the processing and estimation of species richness:

rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))


# subsample and replicate the subsampling 100 times for higher accuracy

chaox100 <- function(x, value_to_subsample) {
  replicate(100, {
    subsample <- vegan::rrarefy(x, value_to_subsample)
    chao <- vegan::estimateR(subsample)
    return(chao)
  })
}

diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))

# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
  print(spec_timepoint)
  rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
  min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
  chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
  names(x) <- gsub("_.*", "", names(x))
  x
}

# adjusted dataset for plotting
{
  chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
  chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
  chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
  # save intermediate file
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
  # diversity mean of x100 replicates per animal
  chao_results_df %>%
    filter(algorithm %in% c("Chao1", "ACE")) %>%
    dplyr::rename(Timepoint_specificity = .id) %>%
    group_by(algorithm, Timepoint_specificity) %>%
    summarise_all(.funs = mean) %>%
    write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)

  chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
    tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
    mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ))
}

7.2 Chao1 plotting between groups for RSV-specific diversity

Here is the plotting and comparison between vaccinated group:

my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
                       c("Total_B2_20-mer", "Total_B2_1-mer"),
                       c("PreF_B1_20-mer", "PreF_B1_1-mer"),
                       c("PreF_B2_20-mer", "PreF_B2_1-mer"),
                       c("DP_B1_20-mer", "DP_B1_1-mer"),
                       c("DP_B2_20-mer", "DP_B2_1-mer"))
chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"

chao_results_df <- chao_results_df %>%
  filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
  filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
    Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
  summarise(value = mean(value)) %>%
  tidyr::drop_na() %>%
  ungroup()

write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)


stat.test <- chao_results_df %>%
              wilcox_test(formula = value ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
value Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.285 ns 272.3352 Total_B1…. 1 2
value Total_B2_20-mer Total_B2_1-mer 5 4 15 0.286 0.343 ns 301.8278 Total_B2…. 3 4
value PreF_B1_20-mer PreF_B1_1-mer 5 4 7 0.556 0.556 ns 331.3205 PreF_B1_…. 5 6
value PreF_B2_20-mer PreF_B2_1-mer 5 4 2 0.064 0.127 ns 360.8131 PreF_B2_…. 7 8
value DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
390.3058 DP_B1_20…. 9 10
value DP_B2_20-mer DP_B2_1-mer 5 4 20 0.016 0.048
419.7984 DP_B2_20…. 11 12
chao_results_df %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "value", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 2) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Chao1)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)

7.3 Recon estimates for RSV-specific diversity

According to Recon default settings and tutorial (check Recon manual), the count files generated previously were used to create the fitfiles.txt that were used as an input for generating the diversity_table.txt for all the samples.

To generate the the fitfile for each count file, the bash script used in a for loop was:

#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE

python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt

Then, each fit file was used for generating the diversity_table.txt with the command:

python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt

The results from diversity_table.txt for all the samples were used as input for plotting.

recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
  mutate(Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
   Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  ungroup()

write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
  

stat.test <- recon_res %>%
              wilcox_test(formula = est_0.0D ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
est_0.0D Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.228 ns 275.7082 Total_B1…. 1 2
est_0.0D Total_B2_20-mer Total_B2_1-mer 5 4 16 0.190 0.228 ns 304.8120 Total_B2…. 3 4
est_0.0D PreF_B1_20-mer PreF_B1_1-mer 5 4 7 0.556 0.556 ns 333.9159 PreF_B1_…. 5 6
est_0.0D PreF_B2_20-mer PreF_B2_1-mer 5 4 2 0.064 0.127 ns 363.0197 PreF_B2_…. 7 8
est_0.0D DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
392.1236 DP_B1_20…. 9 10
est_0.0D DP_B2_20-mer DP_B2_1-mer 5 4 20 0.016 0.048
421.2274 DP_B2_20…. 11 12
recon_res %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 2) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Species richness\n(Recon)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)

7.4 Simpsons diversity (Recon)

Recon program

stat.test <- recon_res %>%
              wilcox_test(formula = est_2.0D ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
est_2.0D Total_B1_20-mer Total_B1_1-mer 5 4 16 0.190 0.570 ns 38.90024 Total_B1…. 1 2
est_2.0D Total_B2_20-mer Total_B2_1-mer 5 4 14 0.413 0.619 ns 42.92893 Total_B2…. 3 4
est_2.0D PreF_B1_20-mer PreF_B1_1-mer 5 4 13 0.556 0.667 ns 46.95762 PreF_B1_…. 5 6
est_2.0D PreF_B2_20-mer PreF_B2_1-mer 5 4 12 0.730 0.730 ns 50.98630 PreF_B2_…. 7 8
est_2.0D DP_B1_20-mer DP_B1_1-mer 5 4 15 0.286 0.572 ns 55.01499 DP_B1_20…. 9 10
est_2.0D DP_B2_20-mer DP_B2_1-mer 5 4 17 0.111 0.570 ns 59.04368 DP_B2_20…. 11 12
recon_res %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "est_2.0D", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 2) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 45)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "Simpsons diversity\n(Recon)", x = "") +
  stat_summary(fun.y = mean, 
                 geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 40) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "recon_simpson_index.pdf"), width = 5, height = 3)

8 Comparing repertoires

8.1 Calculating average SHM per database

ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls)]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
  select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
  mutate(
    Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
    ID = gsub("_.*", "", ID_timepoint),
    database = plyr::mapvalues(.id,
      from = c("IMGTrm", "cirelli", "nestor-rm", "individualized"),
      to = c("IMGTrm", "Cirelli et al.", "KIMDB", "Individualized")
    ),
    database = factor(database, levels = c("IMGTrm", "Cirelli et al.", "KIMDB", "Individualized")),
    Group = plyr::mapvalues(
      ID, c(
        "E11", "E16", "E17", "E23", "E24",
        "E12", "E14", "E18", "E21"
      ),
      c(
        "20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
        "1-mer", "1-mer", "1-mer", "1-mer"
      )
    ),
    cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
    v_call = ifelse(is.na(v_call), V_gene, v_call),
    j_call = ifelse(is.na(j_call), J_gene, j_call)
  ) %>%
  group_by(.id, ID_timepoint) %>%
  distinct(new_name, .keep_all = TRUE) %>%
  ungroup() %>%
  filter(database %in% c("KIMDB", "Individualized"))

rds_merge_stat <- rds_merge %>%
  group_by(database) %>%
  summarise(SHM_mean = mean(V_SHM), SHM_sd = sd(V_SHM), size = n())

rds_merge_stat
## # A tibble: 2 × 4
##   database       SHM_mean SHM_sd   size
##   <fct>             <dbl>  <dbl>  <int>
## 1 KIMDB              4.58   3.43 194554
## 2 Individualized     4.68   3.54 238805
fwrite(rds_merge_stat, paste0("../", result.dir, "merged_databases_shm.csv"), row.names = FALSE, sep = ",")

8.2 Plot number of sequences per animal

full_rep_indiv <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")

full_rep_indiv <- full_rep_indiv %>%
  filter(timepoint != "PV") %>%
  distinct(new_name, .keep_all = TRUE) %>%
  group_by(timepoint, ID) %>%
  summarise(aligned_seq = n())

full_rep_indiv %>%
  filter(timepoint != "Single-cell") %>%
  write.csv(paste0("../", result.dir, "repertoire_depth.csv"),row.names = FALSE)

plot_depth <- full_rep_indiv %>%
  filter(timepoint != "Single-cell") %>%
  mutate(timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
  ggplot(aes(y = aligned_seq, x = timepoint, grp = timepoint)) +
  geom_boxplot(outlier.shape = NA, lwd=1) +
  geom_jitter(aes(color = ID), size = 5) +
  ggpubr::stat_compare_means(paired = TRUE, method = "wilcox", label = "p.format", p.adjust.methods = "fdr") +
  ggpubr::stat_compare_means(method = "kruskal") +
  labs(x = "Timepoint", y = "# aligned sequences") +
  scale_color_viridis_d() +
   ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)

plot_depth

ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)

8.3 Generate data for clone size per animal per database

rds_summary <- rds_merge %>%
  group_by(database, ID_timepoint, grp) %>%
  summarise(
    ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
    clonal_size = n(),
    database,
    sc_clone_grp = first(sc_clone_grp),
    V_errors = mean(V_errors),
    specificity_group = first(specificity_group),
    cdr3_aa_length = mean(cdr3_aa_length),
    v_call = first(v_call),
    j_call = first(j_call)
  ) %>%
  ungroup() %>%
  group_by(database, ID_timepoint) %>%
  mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
  ungroup() %>%
  distinct()

rds_summary_noPV <- rds_summary %>%
  filter(Timepoint != "PV", Timepoint != "Single-cell")

rds_summary_save <- rds_summary %>%
  filter(Timepoint %in% c("Single-cell", "B1","B2"),
         database == "Individualized") %>%
  group_by(ID_timepoint) %>%
  summarise(mean_clonal_size = mean(clonal_size),
            median_clonal_size = median(clonal_size),
            geom_clonal_size = exp(mean(log(clonal_size))),
            unique_clones = sum(clonal_size == 1),
            total_clones_detected = n(),
            percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2)) 

rds_summary_save %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
ID_timepoint mean_clonal_size median_clonal_size geom_clonal_size unique_clones total_clones_detected percentage_unique_clones
E11_B1 73.765766 24.0 23.618580 7 111 6.31
E11_B2 65.842324 18.0 16.139896 22 241 9.13
E11_Single-cell 1.596491 1.0 1.310902 169 228 74.12
E12_B1 96.802632 8.5 11.635387 5 76 6.58
E12_B2 62.696296 9.0 9.733452 18 135 13.33
E12_Single-cell 1.556000 1.0 1.304827 185 250 74.00
E14_B1 79.458333 9.0 9.683393 11 96 11.46
E14_B2 111.530612 10.5 12.335661 29 196 14.80
E14_Single-cell 2.041096 1.0 1.527244 139 219 63.47
E16_B1 73.535519 14.0 15.223779 14 183 7.65
E16_B2 118.491228 31.5 29.123260 17 228 7.46
E16_Single-cell 1.996094 1.0 1.489967 169 256 66.02
E17_B1 38.460526 8.0 9.753009 8 76 10.53
E17_B2 55.614035 6.0 7.902730 31 171 18.13
E17_Single-cell 1.665158 1.0 1.381309 151 221 68.33
E18_B1 41.323529 11.0 11.772174 9 102 8.82
E18_B2 72.662921 21.0 15.643111 23 178 12.92
E18_Single-cell 1.641026 1.0 1.368241 135 195 69.23
E21_B1 35.201835 5.0 6.887328 21 109 19.27
E21_B2 45.202312 14.0 13.455144 16 173 9.25
E21_Single-cell 1.674208 1.0 1.385378 151 221 68.33
E23_B1 59.675497 9.0 11.417886 19 151 12.58
E23_B2 63.044025 13.0 14.180265 18 159 11.32
E23_Single-cell 1.916279 1.0 1.484490 136 215 63.26
E24_B1 55.469231 10.0 11.115858 18 130 13.85
E24_B2 62.310735 9.0 10.429075 29 177 16.38
E24_Single-cell 1.829897 1.0 1.400170 137 194 70.62
write.csv(rds_summary_save, 
          paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)

8.4 Correlation clone size per animal per database

df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
  for (databases in unique(rds_summary_noPV$database)) {
    rds_timepoint <- rds_summary_noPV %>%
      filter(Timepoint == timepoints, database == databases) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    rds_indiv <- rds_summary_noPV %>%
      filter(
        database == "Individualized",
        Timepoint == timepoints
      ) %>%
      arrange(desc(clonal_size)) %>%
      pull(clonal_size)

    length(rds_indiv) <- length(rds_timepoint)

    df <- data.frame(
      size_indiv = rds_indiv,
      clonal_size = rds_timepoint,
      Timepoint = timepoints,
      database = databases
    )
    df[is.na(df)] <- 0
    df_all <- rbind(df, df_all)
  }
}


df_all %>%
  filter(database != "Individualized") %>%
  mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
  ggplot(aes(x = size_indiv, y = clonal_size)) +
  geom_point(size = 1) +
  geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
  scale_x_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  scale_y_continuous(
    trans = pseudo_log_trans(base = 10),
    breaks = c(1, 10, 100, 1000, 10000),
    labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
  ) +
  labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~ Timepoint)

ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)

8.5 IGHV-J pairing per database

for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
  # check if we want to filter clonotypes by size or not
  filter(database == i) %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  mutate(v_call = sub("\\*.*","", v_call),
         j_call = sub("\\*.*|-.*","", j_call),
         clonal_size_log = log10(clonal_size),
         Group = factor(Group, levels = c("20-mer", "1-mer")),
         specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
  group_by(Group) %>%
  mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
  ungroup() %>% 
  group_by(Group, specificity_group, v_call, j_call) %>%
  summarise(clonal_size_perc = sum(clonal_size_perc)) %>%
  filter(specificity_group != "PostF") %>% droplevels()

rds_summary_noPV %>%
  ggplot(aes(x= v_call, y = j_call, fill = clonal_size_perc)) +
    geom_tile(color = "black") +
    scale_fill_viridis_c(option = "viridis", direction = 1) +
    scale_y_discrete(position = "right") +
    ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
    theme(axis.text.x = element_text(angle = 90, size = 8, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
          axis.text.y = element_text(face = "bold", colour = "black", size = 8),
          strip.text = element_text(face = "bold"),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          legend.position =  "top",
          axis.ticks = element_line(size = .5)) +
    labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles") +
    facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")

ggsave(paste0("../", result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}

8.6 IGHV-HJ pairing LOR mabs

lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")
mabs_lors <- mabs_lors %>%
  filter(name %in% lor_mabs$well_ID) %>%
  mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
  arrange(mAbs_ID) %>%
  relocate(mAbs_ID)

mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
  mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp), 
                                  mabs_lors[mabs_lors$name == i,]$mAbs_ID, 
                                  NA)
}

col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])         
mabs_clonal_rank <- mabs_clonal_rank %>%  
  mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
  select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
  filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
  arrange(LOR) %>%
  mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
  filter(ID == well_ID)

mabs_lors <- mabs_lors %>% 
  mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
         clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
         clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
  relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)

# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2), LOR40(not detected B2), LOR42 (same clone as LOR01)
mabs_lors %>% write.csv(paste0("../", result.dir,"LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(paste0("../results/2022-09-23/LOR_mAbs_info.csv"))

mabs_lors %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = substr(name,1,3))) +
  geom_point(size = 3) +
  scale_color_viridis_d() +
  scale_y_log10() +
  scale_x_log10() +
  labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "clonal_size_vs_clonal_rank.pdf"), width = 5, height = 3)

8.7 Count unique V-J pairs

rds_summary_noPV <- rds_summary %>%
  # If want to remove clonal sizer < 1, do it here.
  filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) %>%
  summarise(unique_v_j = n()) %>%
  ungroup()

rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)

stat.test <- rds_summary_noPV %>%
              wilcox_test(formula = unique_v_j ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
unique_v_j Total_B1_20-mer Total_B1_1-mer 5 4 14 0.413 0.496 ns 137.040 Total_B1…. 1 2
unique_v_j Total_B2_20-mer Total_B2_1-mer 5 4 11 0.901 0.901 ns 150.288 Total_B2…. 3 4
unique_v_j PreF_B1_20-mer PreF_B1_1-mer 5 4 4 0.176 0.264 ns 163.536 PreF_B1_…. 5 6
unique_v_j PreF_B2_20-mer PreF_B2_1-mer 5 4 0 0.016 0.048
176.784 PreF_B2_…. 7 8
unique_v_j DP_B1_20-mer DP_B1_1-mer 5 4 20 0.016 0.048
190.032 DP_B1_20…. 9 10
unique_v_j DP_B2_20-mer DP_B2_1-mer 5 4 18 0.064 0.127 ns 203.280 DP_B2_20…. 11 12
rds_summary_noPV %>%
  ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j", 
                    fill = "Group", 
                    color = "Group",
                    group = "Group_specificity", 
                    legend = "none", size = 2) +
  geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = fill_col_values) +
  scale_color_manual(values = color_values) +
  labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
  stat_summary(fun.y = mean, 
               geom = "crossbar") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)

8.8 VennDiagram of HV-HJ sharing

rds_summary_noPV <- rds_summary %>%
  # decide if clonal size greater than 1 should be included
  filter(database == "Individualized", Timepoint != "Single-cell") %>%
  rbind(., within(., specificity_group <- "Total")) %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
         Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
  group_by(Group) %>%
  mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
  distinct(v_j_calls) 

list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
                 "1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])

ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7, 
                             label_alpha = 0,
                             set_color = "black",
                             set_size = 9) +
  scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
  scale_color_manual(values = c("#5F90B0", "#D896C1")) +
  theme(legend.position = "none")

ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)

8.9 Process selected LORs with individualized database for plotting

rds_indiv <- rds_summary %>%
  filter(
    database == "Individualized",
    Timepoint != "Single-cell",
    Timepoint != "PV"
  ) %>%
  mutate(
    LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
    LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
  )
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
  tidyr::expand(ID, Timepoint, grp) %>%
  filter(Timepoint != "Single-cell")

rds_indiv <- rds_indiv %>%
  right_join(all) %>%
  mutate(
    ID_timepoint = paste(ID, Timepoint, sep = "_"),
    database = "individualized",
    clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
  ) %>%
  arrange(grp, ID_timepoint) %>%
  tidyr::fill(LOR, Group, sc_clone_grp)

8.10 Clone size area plot (cumulative) of 20-mer and 1-mer, divided by specificity (DP & PreF)

NP_cum_test <- rds_indiv %>%
  group_by(Group, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  ungroup() %>%
  tidyr::pivot_wider(values_from = clonal_size, names_from = Timepoint) %>%
  filter(Group == "20-mer") %>%
  select(-Group) %>% as.matrix()
rownames(NP_cum_test) <- NP_cum_test[,1]
NP_cum_test <- NP_cum_test[-c(2,4),]
NP_cum_test <- apply(NP_cum_test[,-1], FUN = as.numeric, MARGIN = c(1,2))


sol_cum_test <- rds_indiv %>%
  group_by(Group, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  ungroup() %>%
  tidyr::pivot_wider(values_from = clonal_size, names_from = Timepoint) %>%
  filter(Group == "1-mer") %>%
  select(-Group) %>% as.matrix()
rownames(sol_cum_test) <- sol_cum_test[,1]
sol_cum_test <- sol_cum_test[-c(2,4),]
sol_cum_test <- apply(sol_cum_test[,-1], FUN = as.numeric, MARGIN = c(1,2))

## Chi-square test for nanoparticle group
chisq.test(NP_cum_test, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  NP_cum_test
## X-squared = 73.484, df = 1, p-value < 2.2e-16
## Chi-square test for soluble group
chisq.test(sol_cum_test, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  sol_cum_test
## X-squared = 726.86, df = 1, p-value < 2.2e-16
rds_indiv %>%
  group_by(Group, Timepoint, specificity_group) %>%
  summarise(clonal_size = sum(clonal_size)) %>%
  tidyr::drop_na() %>%
  filter(specificity_group != "PostF") %>%
  ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
  geom_area(color = "black") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
  scale_x_discrete(expand = c(0.02, 0.02)) +
  scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
  labs(y = "Clonal size") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~Group)

ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)

8.11 Plot SHM over time for individualized database

Data is divided by 20-mer, 1-mer, and specificities

rds_indiv_total <- rds_indiv %>%
  mutate(specificity_group = "Total")

rds_indiv <- rbind(rds_indiv, rds_indiv_total)


rds_indiv_plot <- rds_indiv %>%
  filter(specificity_group != "PostF") %>%
  mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
  mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup() 

stat.test <- rds_indiv_plot %>%
              wilcox_test(formula = V_errors ~ Group_specificity, 
                          comparisons = my_comparisons, 
                          p.adjust.method = "fdr", 
                          paired = FALSE) %>%
  add_xy_position()

stat.test %>%
  kable %>%
  kable_styling("striped") %>% 
 scroll_box(height = "200px")
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
V_errors Total_B1_20-mer Total_B1_1-mer 651 383 119282.5 0.246 0.492 ns 68.320 Total_B1…. 1 2
V_errors Total_B2_20-mer Total_B2_1-mer 976 682 349160.0 0.088 0.265 ns 73.504 Total_B2…. 3 4
V_errors PreF_B1_20-mer PreF_B1_1-mer 205 205 21279.5 0.824 0.841 ns 78.688 PreF_B1_…. 5 6
V_errors PreF_B2_20-mer PreF_B2_1-mer 346 389 69762.5 0.391 0.587 ns 83.872 PreF_B2_…. 7 8
V_errors DP_B1_20-mer DP_B1_1-mer 413 144 29401.0 0.841 0.841 ns 89.056 DP_B1_20…. 9 10
V_errors DP_B2_20-mer DP_B2_1-mer 583 250 79353.0 0.042 0.251 ns 94.240 DP_B2_20…. 11 12
rds_indiv_plot %>%
  ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity", 
                    legend = "none") +
  geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
  geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
  scale_shape_manual(values=NA)+
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
  labs(y = "# IGHV nucleotide mutations",  x= "") +
  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
  ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
  theme(axis.ticks = element_line(size = .5),
        legend.position = "none")

ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)

8.12 Plot HCDR3 length over time for individualized database divided by 20-mer and 1-mer, and specificities

rds_indiv %>%
  filter(specificity_group != "PostF") %>%
  mutate(
    Group_specificity = paste(Group, specificity_group, sep = "_"),
    specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
    Timepoint = factor(Timepoint, levels = c("B1", "B2"))
  ) %>%
  ggplot(aes(x = cdr3_aa_length, fill = Group)) +
  geom_density(alpha = .7) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
  labs(y = "Density", x = "HCDR3 aa length") +
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5)) +
  facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")

ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)

9 Comparison of discovered alleles (individualized database)

9.1 Plot V alleles unique and shared per animal, stacked plot

ls <- list.files("../data/databases/individualized", recursive = T, full.names = T, pattern = "V.fasta")
names(ls) <- basename(dirname(ls))
ls <- ls[-1] # remove combined V genes
individualized_dbs <- lapply(ls, Biostrings::readDNAStringSet)
size_indv_db <- data.frame(
  v_count = unlist(lapply(individualized_dbs, length)),
  type = "Shared"
) %>%
  tibble::rownames_to_column("ID")
individualized_dbs <- unlist(Biostrings::DNAStringSetList(individualized_dbs))
duplicated_names <- c(names(unique(individualized_dbs[length(individualized_dbs):1, ])), names(unique(individualized_dbs)))
individualized_dbs_sel <- individualized_dbs[setdiff(names(individualized_dbs), duplicated_names)]
size_unique_db <- data.frame(
  v_unique = table(substr(names(unique(individualized_dbs_sel)), 1, 3)),
  type = "Unique") %>%
  dplyr::rename(v_count = v_unique.Freq,
         ID = v_unique.Var1)


unique_v_indiv <- rbind(size_indv_db, size_unique_db) %>%
  add_row(ID = c("E11", "E24"), v_count = rep(0), type = rep("Unique")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )
unique_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_unique_shared_per_animal.csv"), row.names = FALSE)

unique_v_indiv %>%
  mutate(type = factor(type, levels = c("Unique", "Shared"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "unique_and_shared_alleles.pdf"), width = 8, height = 6.38)

9.2 Plot V alleles validated or not per animal, stacked plot

kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")

joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
##      width seq                                              names               
##  [1]   296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
##  [2]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
##  [3]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [4]   293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
##  [5]   298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
##  ...   ... ...
## [11]   296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12]   299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13]   296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14]   299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15]   297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
  v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
  type = "Not validated"
) %>%
  dplyr::rename(
    v_count = v_unique.Freq,
    ID = v_unique.Var1
  )
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
  add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
  group_by(ID) %>%
  arrange(ID, .by_group = TRUE) %>%
  mutate(
    diff = v_count - lag(v_count, default = last(v_count)),
    diff = ifelse(diff < 0, v_count, diff)
  )

kimdb_v_indiv %>%
  write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)

kimdb_v_indiv %>%
  mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
  ggplot(aes(x = ID, y = diff, fill = type)) +
  geom_col(color = "black") +
  scale_fill_viridis_d(direction = -1) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
  scale_x_discrete(expand = c(0, 0)) +
  labs(y = "V alelle counts", x = "Animal ID") +
  theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5))

ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)

10 Phylogenetic tree LOR21 and LOR24

10.1 Processing data

clonal_tree_data <- clonotype_rsv %>%
  ungroup() %>%
  distinct(new_name, .keep_all = TRUE) %>%
  group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
  add_tally(name = "clonal_size") %>%
  ungroup()

mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03")

# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")

# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")

clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE) 

10.2 Generating fasta files for tree plotting

10.3 Heavy chain

sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA heavy chain
  UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
  UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
  UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
  names(UCA) <- paste0(mab_name,"_UCA")
  # select sequences part of the same clonotype
  group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clonal_tree_data %>%
    filter(grp == group) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
            row.names = FALSE)
  }
  # deduplicating sequences
  fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
  
  #adding single cell sequences
  sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
  
  #saving duplicated
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
    sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
  
  fasta <- c(fasta, sc_seqs, UCA)
  fasta <- fasta[unique(names(fasta))]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
 
  
}
## [1] 1
## [1] 2

10.4 Light chain

for(i in seq_along(mabs_of_interests)){
  print(i)
  mab <- mabs_of_interests[i]
  
  if(any(stringr::str_count(names(mab), "LOR") <= 1)){
    mab <- mabs_of_interests[i]
    mab_name <- names(mab)
  }
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
    mab_name <- names(mab)}
  # create UCA light chain
  UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
  UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
  UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
  names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
  # select light chain clonotypes
  group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
  filtered <- clono_light_chains %>%
    filter(grp == group, substr(name,1,3) == substr(mab,1,3)) 
 
  # save csv with b cell lineage lineages info
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
            row.names = FALSE)
  }else{
  write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
            row.names = FALSE)
  }
  # save object as BStringset
  fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
  
  # save duplicated values
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }else{
    sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
    }
  
  fasta <- c(fasta, UCA_LC)
  fasta <- fasta[names(fasta)]
  
  if(any(stringr::str_count(names(mab), "LOR") > 1)){
    writeXStringSet(fasta, filepath = paste0("../",result.dir,  mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
    }else{
    writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2

10.5 Run MUSCLE on bash

Do a Multiple Sequence Alignment using MUSCLE (v 5.1) for all the fasta files generated through out this analysis. Following that, run FastTree (v 2.1.11) for all the aligned sequences and save tree output to be plotted on the following code.


source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/    
for f in *.fasta; do muscle -align $f -output $f.aln; done

for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 217 seqs, avg length 364, max 370
## 
## 00:00 4.5Mb  CPU has 8 cores, running 8 threads
## 00:00 4.7Mb   0.0043% Calc posteriors
00:01 153Mb    0.62% Calc posteriors 
00:02 200Mb     1.9% Calc posteriors
00:03 224Mb     3.1% Calc posteriors
00:04 243Mb     4.3% Calc posteriors
00:05 256Mb     5.5% Calc posteriors
00:06 258Mb     6.7% Calc posteriors
00:07 277Mb     7.9% Calc posteriors
00:08 288Mb     9.1% Calc posteriors
00:09 315Mb    10.3% Calc posteriors
00:10 358Mb    11.5% Calc posteriors
00:11 362Mb    12.7% Calc posteriors
00:12 378Mb    13.8% Calc posteriors
00:13 423Mb    14.9% Calc posteriors
00:14 493Mb    16.1% Calc posteriors
00:15 514Mb    17.2% Calc posteriors
00:16 542Mb    18.4% Calc posteriors
00:17 571Mb    19.4% Calc posteriors
00:18 575Mb    20.6% Calc posteriors
00:19 610Mb    21.7% Calc posteriors
00:20 654Mb    22.9% Calc posteriors
00:21 703Mb    24.0% Calc posteriors
00:22 718Mb    25.0% Calc posteriors
00:23 753Mb    26.2% Calc posteriors
00:24 785Mb    27.3% Calc posteriors
00:25 809Mb    28.4% Calc posteriors
00:26 823Mb    29.6% Calc posteriors
00:27 849Mb    30.6% Calc posteriors
00:28 773Mb    31.7% Calc posteriors
00:29 671Mb    32.9% Calc posteriors
00:30 709Mb    34.0% Calc posteriors
00:31 731Mb    35.1% Calc posteriors
00:32 744Mb    36.3% Calc posteriors
00:33 766Mb    37.4% Calc posteriors
00:34 783Mb    38.5% Calc posteriors
00:35 811Mb    39.4% Calc posteriors
00:36 818Mb    40.5% Calc posteriors
00:37 846Mb    41.6% Calc posteriors
00:38 857Mb    42.7% Calc posteriors
00:39 870Mb    43.9% Calc posteriors
00:40 913Mb    45.0% Calc posteriors
00:41 796Mb    46.1% Calc posteriors
00:42 803Mb    47.2% Calc posteriors
00:43 824Mb    48.3% Calc posteriors
00:44 837Mb    49.3% Calc posteriors
00:45 860Mb    50.4% Calc posteriors
00:46 884Mb    51.6% Calc posteriors
00:47 888Mb    52.7% Calc posteriors
00:48 900Mb    53.8% Calc posteriors
00:49 919Mb    54.9% Calc posteriors
00:50 800Mb    55.9% Calc posteriors
00:51 812Mb    57.0% Calc posteriors
00:52 834Mb    58.1% Calc posteriors
00:53 748Mb    59.2% Calc posteriors
00:54 781Mb    60.3% Calc posteriors
00:55 813Mb    61.4% Calc posteriors
00:56 827Mb    62.5% Calc posteriors
00:57 834Mb    63.6% Calc posteriors
00:58 850Mb    64.8% Calc posteriors
00:59 861Mb    65.9% Calc posteriors
01:00 890Mb    67.1% Calc posteriors
01:01 925Mb    68.2% Calc posteriors
01:02 821Mb    69.3% Calc posteriors
01:03 856Mb    70.5% Calc posteriors
01:04 871Mb    71.6% Calc posteriors
01:05 908Mb    72.7% Calc posteriors
01:06 932Mb    73.8% Calc posteriors
01:07 961Mb    75.0% Calc posteriors
01:08 983Mb    76.1% Calc posteriors
01:09 1.0Gb    77.2% Calc posteriors
01:10 920Mb    78.1% Calc posteriors
01:11 958Mb    79.3% Calc posteriors
01:12 981Mb    80.4% Calc posteriors
01:13 1.0Gb    81.6% Calc posteriors
01:14 892Mb    82.7% Calc posteriors
01:15 919Mb    83.8% Calc posteriors
01:16 935Mb    84.9% Calc posteriors
01:17 951Mb    86.1% Calc posteriors
01:18 958Mb    87.2% Calc posteriors
01:19 965Mb    88.3% Calc posteriors
01:20 970Mb    89.5% Calc posteriors
01:21 854Mb    90.6% Calc posteriors
01:22 901Mb    91.7% Calc posteriors
01:23 907Mb    92.9% Calc posteriors
01:24 927Mb    94.0% Calc posteriors
01:25 968Mb    95.2% Calc posteriors
01:26 988Mb    96.3% Calc posteriors
01:27 864Mb    97.4% Calc posteriors
01:28 877Mb    98.6% Calc posteriors
01:29 897Mb    99.7% Calc posteriors
01:29 898Mb   100.0% Calc posteriors
## 01:29 898Mb   0.0043% Consistency (1/2)
01:30 916Mb    10.4% Consistency (1/2) 
01:31 950Mb    30.0% Consistency (1/2)
01:32 984Mb    49.5% Consistency (1/2)
01:33 1.0Gb    69.4% Consistency (1/2)
01:34 1.1Gb    88.8% Consistency (1/2)
01:34 1.1Gb   100.0% Consistency (1/2)
## 01:34 1.1Gb   0.0043% Consistency (2/2)
01:35 1.1Gb     3.9% Consistency (2/2) 
01:36 1.1Gb    12.6% Consistency (2/2)
01:37 1.1Gb    28.7% Consistency (2/2)
01:38 1.1Gb    43.9% Consistency (2/2)
01:39 1.1Gb    58.0% Consistency (2/2)
01:40 1.1Gb    74.5% Consistency (2/2)
01:41 1.1Gb    91.1% Consistency (2/2)
01:41 1.1Gb   100.0% Consistency (2/2)
## 01:41 1.1Gb    0.46% UPGMA5           
01:41 1.1Gb   100.0% UPGMA5
## 01:41 1.1Gb     1.0% Refining
01:42 1.1Gb    11.0% Refining
01:43 1.1Gb    38.0% Refining
01:44 1.1Gb    72.0% Refining
01:44 1.1Gb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 8 seqs, avg length 322, max 326
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb     3.6% Calc posteriors
00:00 77Mb    100.0% Calc posteriors
## 00:00 80Mb      3.6% Consistency (1/2)
00:00 80Mb    100.0% Consistency (1/2)
## 00:00 80Mb      3.6% Consistency (2/2)
00:00 80Mb    100.0% Consistency (2/2)
## 00:00 80Mb     14.3% UPGMA5           
00:00 80Mb    100.0% UPGMA5
## 00:00 80Mb      1.0% Refining
00:00 80Mb    100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 118 seqs, avg length 370, max 370
## 
## 00:00 2.9Mb  CPU has 8 cores, running 8 threads
## 00:00 3.1Mb   0.014% Calc posteriors
00:01 163Mb     2.8% Calc posteriors
00:02 184Mb     6.1% Calc posteriors
00:03 212Mb     9.8% Calc posteriors
00:04 239Mb    13.2% Calc posteriors
00:05 265Mb    16.9% Calc posteriors
00:06 309Mb    20.6% Calc posteriors
00:07 330Mb    24.2% Calc posteriors
00:08 365Mb    27.6% Calc posteriors
00:09 402Mb    31.1% Calc posteriors
00:10 443Mb    34.0% Calc posteriors
00:11 479Mb    36.8% Calc posteriors
00:12 525Mb    40.2% Calc posteriors
00:13 567Mb    43.4% Calc posteriors
00:14 604Mb    46.7% Calc posteriors
00:15 628Mb    49.6% Calc posteriors
00:16 650Mb    52.7% Calc posteriors
00:17 690Mb    55.9% Calc posteriors
00:18 725Mb    59.0% Calc posteriors
00:19 767Mb    61.6% Calc posteriors
00:20 782Mb    64.5% Calc posteriors
00:21 833Mb    67.4% Calc posteriors
00:22 874Mb    70.5% Calc posteriors
00:23 909Mb    73.6% Calc posteriors
00:24 947Mb    76.6% Calc posteriors
00:25 959Mb    80.1% Calc posteriors
00:26 978Mb    83.6% Calc posteriors
00:27 889Mb    87.1% Calc posteriors
00:28 923Mb    90.6% Calc posteriors
00:29 833Mb    94.2% Calc posteriors
00:30 852Mb    97.9% Calc posteriors
00:30 875Mb   100.0% Calc posteriors
## 00:30 875Mb   0.014% Consistency (1/2)
00:31 891Mb    31.5% Consistency (1/2)
00:31 925Mb   100.0% Consistency (1/2)
## 00:31 925Mb   0.014% Consistency (2/2)
00:32 925Mb    53.2% Consistency (2/2)
00:32 925Mb   100.0% Consistency (2/2)
## 00:32 925Mb    0.85% UPGMA5           
00:32 925Mb   100.0% UPGMA5
## 00:32 926Mb     1.0% Refining
00:33 930Mb    57.0% Refining
00:33 933Mb   100.0% Refining
## 
## muscle 5.1.osx64 []  34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
## 
## Input: 10 seqs, avg length 331, max 334
## 
## 00:00 2.1Mb  CPU has 8 cores, running 8 threads
## 00:00 2.3Mb     2.2% Calc posteriors
00:00 90Mb    100.0% Calc posteriors
## 00:00 92Mb      2.2% Consistency (1/2)
00:00 92Mb    100.0% Consistency (1/2)
## 00:00 92Mb      2.2% Consistency (2/2)
00:00 92Mb    100.0% Consistency (2/2)
## 00:00 92Mb     11.1% UPGMA5           
00:00 92Mb    100.0% UPGMA5
## 00:00 92Mb      1.0% Refining
00:00 93Mb    100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.06 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
##       0.11 seconds: SPR round   1 of   2, 101 of 432 nodes
##       0.23 seconds: SPR round   1 of   2, 401 of 432 nodes
##       0.36 seconds: SPR round   2 of   2, 301 of 432 nodes
## Total branch-length 2.197 after 0.42 sec
##       0.50 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.56
##       0.64 seconds: Optimizing GTR model, step 3 of 12
##       0.75 seconds: Optimizing GTR model, step 6 of 12
##       0.86 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
##       0.97 seconds: ML Lengths 101 of 215 splits
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       1.14 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.22
## Turning off heuristics for final round of ML NNIs (converged)
##       1.31 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 1.42 (final)
##       1.42 seconds: ML Lengths 1 of 215 splits
## Optimize all lengths: LogLk = -5298.540 Time 1.47
##       1.57 seconds: ML split tests for    100 of    214 internal splits
##       1.69 seconds: ML split tests for    200 of    214 internal splits
## Total time: 1.71 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.01
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.02
## Total time: 0.02 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.02 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
##       0.10 seconds: ME NNI round 10 of 28, 1 of 116 splits
## Total branch-length 1.596 after 0.18 sec
##       0.25 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.26
##       0.35 seconds: Optimizing GTR model, step 7 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
##       0.46 seconds: ML Lengths 101 of 116 splits
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
##       0.57 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 0.59
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 0.63
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 0.66
## Turning off heuristics for final round of ML NNIs (converged)
##       0.74 seconds: ML NNI round 5 of 14, 101 of 116 splits, 0 changes
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 0.75 (final)
## Optimize all lengths: LogLk = -3814.284 Time 0.78
##       0.89 seconds: ML split tests for    100 of    115 internal splits
## Total time: 0.91 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.00 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.03
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.03 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.03
## Total time: 0.04 seconds Unique: 9/10 Bad splits: 0/6

10.6 Reading and plotting trees

Generated trees arre edited using treeio and plotted with ggtree. The trees were rerooted to their respectives Unmutated Common Ancestor (UCA), which in this case is defined as just the V and J gene germlines combined. The gap between V-J is inserted automatically by the alignment method, thus the CDR3 here is not considered for the UCA.

# function to root tree in UCA 
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}

ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)

names(ls) <- lapply(ls, function(x) {
  if (stringr::str_count(x, "LOR") > 1) {
    substr(x, 24, 34)
  }else if (grepl("LC",x)) {
    substr(x, 24, 31)
  }else if (grepl("LOR",x)) {
    substr(x, 24, 28)
  }else{
    substr(x, 24,33)
    }})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)

for(i in seq_along(plots)){
  print(i)
  plot_name <- names(plots)[i]
  
  plots_edit <- plots[[i]]$data %>%
    mutate(new_label = ifelse(grepl("sc_|LOR",label), gsub("sc_", "",label), ""),
           new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
                                       warn_missing = FALSE),
           new_label = ifelse(grepl("LOR",new_label), new_label, ""),
           name = label,
           timepoint = case_when(grepl("B2-igg",name) ~ "B2",
                                 grepl("B1-igg",name) ~ "B1",
                                 grepl("sc_|LOR",name) ~ "single_cell",
                                 grepl("igm",name) ~ "PV",
                                 TRUE ~ "intersects"),
           name = gsub("sc_", "", label))
  
  
  if(stringr::str_count(plot_name, "LOR") > 1){
  plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name") 
  }else{ 
  plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name") 
  }
 # shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
  colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")

  gg_plot <- ggtree(plots_edit,aes(color = timepoint)) + 
    {if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
      else geom_tippoint(aes(size = duplicated), shape = 18)}+
   # geom_tippoint(aes(size = duplicated), shape = 23) +
    geom_tiplab(aes(label = new_label), hjust = -.2) +
    labs(size = "Count", shape = "Shape", color = "Timepoint") + 
    scale_color_manual(values = colors_timepoints) +
    coord_cartesian(clip = 'off') + 
   # scale_shape_manual(values = shapes_timepoints) +
    scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
    geom_treescale(width = .05)

  gg_plot
  
   ggsave(gg_plot, filename = paste0("../", result.dir,  names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 3, 4), height = ifelse(grepl("LC", plot_name), 3, 6))
  }
## [1] 1
## [1] 2
## [1] 3
## [1] 4

12 Heatmap of competition between LOR mAbs

LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.

#edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-03-10_LOR_norm-dAUC.csv", header = T, row.names = 1) 


data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)

# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF")

{
  g_heatmap_scale <- data_comp_auc %>%
    select(-all_of(c(to_remove))) %>%
    mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
    t() %>%
  pheatmap::pheatmap(scale = "none",cutree_cols = 8, angle_col = 90, fontsize_row = 8, clustering_method = "ward.D", color = viridisLite::viridis(100), cluster_rows = FALSE)
 
  ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 24, height = 10, units = "cm")
}

12.1 Multidimensional scaling of competition between LOR mAbs

This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.

# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c("D25","101F","Pali","MPE8","ADI", paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))

data_comp_auc_neut <- data_comp_auc %>% tibble::rownames_to_column("mAb") 

# compute MDS
mds_scaled <- data_comp_auc %>%
  scale(center = TRUE, scale = TRUE) %>%
  dist(method = "euclidean") %>%          
  cmdscale() %>%
  as_tibble()

colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
  
p1 <- mds_scaled %>%
  ggplot(aes(Dim.1,Dim.2, label = name, color = name)) +
  geom_point(size = 2) +
  ggrepel::geom_text_repel(max.overlaps = 5) +
  scale_color_manual(values = c("#E64B35FF","#7E6148FF","#00A087FF","#3C5488FF", "#F39B7FFF",rep("#4DBBD5FF",100)))+
  ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
  theme(axis.ticks = element_line(size = .5),
        aspect.ratio = 1,
        legend.position = "none") 
  
plotly::ggplotly(p1)
ggsave(width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-plus-binding.pdf"))

13 Take environment snapshot

renv::snapshot()

14 SessionInfo

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4    dplyr_1.1.0         ggprism_1.0.4      
##  [4] scales_1.2.1        vegan_2.6-4         lattice_0.20-45    
##  [7] permute_0.9-7       data.table_1.14.8   Biostrings_2.66.0  
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0      IRanges_2.32.0     
## [13] S4Vectors_0.36.0    BiocGenerics_0.44.0 ggpubr_0.6.0       
## [16] ggplot2_3.4.1       rstatix_0.7.2       ggtree_3.6.0       
## [19] treeio_1.22.0       tidyr_1.3.0         jsonlite_1.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-162           bitops_1.0-7           sf_1.0-7              
##  [4] RColorBrewer_1.1-3     webshot_0.5.4          httr_1.4.5            
##  [7] tools_4.2.2            backports_1.4.1        bslib_0.4.2           
## [10] utf8_1.2.3             R6_2.5.1               KernSmooth_2.23-20    
## [13] DBI_1.1.3              lazyeval_0.2.2         mgcv_1.8-42           
## [16] colorspace_2.1-0       withr_2.5.0            tidyselect_1.2.0      
## [19] compiler_4.2.2         cli_3.6.0              rvest_1.0.3           
## [22] xml2_1.3.3             plotly_4.10.1          labeling_0.4.2        
## [25] sass_0.4.5             classInt_0.4-9         proxy_0.4-27          
## [28] systemfonts_1.0.4      stringr_1.5.0          digest_0.6.31         
## [31] yulab.utils_0.0.6      rmarkdown_2.20         svglite_2.1.1         
## [34] pkgconfig_2.0.3        htmltools_0.5.4        fastmap_1.1.1         
## [37] highr_0.10             htmlwidgets_1.6.1      rlang_1.0.6           
## [40] rstudioapi_0.14        ggVennDiagram_1.2.2    gridGraphics_0.5-1    
## [43] jquerylib_0.1.4        generics_0.1.3         farver_2.1.1          
## [46] crosstalk_1.2.0        car_3.1-1              RCurl_1.98-1.10       
## [49] magrittr_2.0.3         ggplotify_0.1.0        GenomeInfoDbData_1.2.9
## [52] patchwork_1.1.2        Matrix_1.5-3           Rcpp_1.0.10           
## [55] munsell_0.5.0          fansi_1.0.4            ape_5.7               
## [58] abind_1.4-5            lifecycle_1.0.3        stringi_1.7.12        
## [61] yaml_2.3.7             carData_3.0-5          MASS_7.3-58.3         
## [64] zlibbioc_1.44.0        plyr_1.8.8             grid_4.2.2            
## [67] ggrepel_0.9.3          parallel_4.2.2         crayon_1.5.2          
## [70] splines_4.2.2          knitr_1.42             pillar_1.8.1          
## [73] ggsignif_0.6.4         glue_1.6.2             evaluate_0.20         
## [76] ggfun_0.0.9            vctrs_0.5.2            gtable_0.3.1          
## [79] purrr_1.0.1            cachem_1.0.7           xfun_0.37             
## [82] broom_1.0.4            e1071_1.7-13           tidytree_0.4.2        
## [85] class_7.3-21           RVenn_1.1.0            viridisLite_0.4.1     
## [88] pheatmap_1.0.12        tibble_3.2.0           aplot_0.1.10          
## [91] units_0.8-1            cluster_2.1.4          ellipsis_0.3.2